Data Validation in Excel and R

Author

Eric R. Scott

Published

August 24, 2022

Overview

There are many opportunities for human or instrument error to affect data. Ideally, you want to find those errors and fix them early and often! This workshop introduces some tools in Excel and R to avoid making mistakes in data entry and data collection, and to detect the ones you inadvertently make.

Learning Objectives

  • Understand some best practices for entering data

  • Use Excel data validation tools to prevent data entry errors

  • Use double-entry and anti_join() to check for data entry mistakes

  • Explore data summaries to check for errors

  • Be familiar with the basics of data validation with pointblank

We are only interested in fixing verifiable mistakes! It is generally not appropriate to remove or edit outliers—extreme values that may or may not be accurate.

Software needed

You’ll need access to Excel, R, RStudio, and the following R packages:

library(tidyverse)
library(visdat)
library(pointblank)
library(skimr)

If you can, install the development version of pointblank using this R code:

if(!require("remotes")) {
  install.packages("remotes")
} 
remotes::install_github("rich-iannone/pointblank")

Data Entry in Spreadsheets

A common and useful way to enter data is in a spreadsheet (e.g. in Excel or Google Sheets). Here are some tips for avoiding data entry mistakes and saving yourself time and headaches with data wrangling:

  • For categorical variables and notes use consistent values (e.g. don’t sometimes write “dead” and other times write “Dead”)
  • Explicitly record missing data instead of leaving cells blank
  • Use consistent, concise, descriptive, machine-readable column headers (this is hard!)
  • Use ISO (YYYY-MM-DD) format for dates
Tip

For more detailed guidelines on using spreadsheets for data entry, I highly recommend Data Organization in Spreadsheets by Broman and Woo (2017)

Avoiding mistakes in data entry

Set up validation tools in your data entry spreadsheet to stop data entry errors in their tracks!

gif of cat typing furiously on a laptop

Data Validation Tools in Excel

Data Validation dialog box in Excel.  Settings tab is shown with dropdown menu for Allow:

  • Select a column (or cells) and choose Data > Validation … from the menu

  • Use “list” to restrict to specific values for categorical data

  • Use “whole number” for count data

  • Use “date” to restrict date ranges

  • Can also be set up after data entry. Highlight invalid data with “Circle Invalid Data” from toolbar

Watch out for Excel autocorrect!

Nature headline: "Autocorrect errors in Excel still creating genomics headache.  Despite geneticists being warned about spreadsheet problems, 30% of published papers contain mangled gene names in supplementary data.

Confused anime guy with butterfly meme where the guy has a Microsoft Excel logo on his face, the butterfly is "any data at all" and the caption is "is this a date?"

To stop Excel from converting entries to dates:

  1. Explicitly set all column types to numeric, text, date, etc.

  2. Make sure no columns are set to “general”

Double-entry Method

  • Two people enter the same data, then compare programatically.

  • In the data folder, there are two versions of a dataset—one entered by Eric and one entered by Jessica.

eric <- read_csv("data/tea_eric.csv")
Rows: 60 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (2): field, counter
dbl  (10): plant_id, shoot_1, shoot_2, shoot_3, shoot_4, shoot_5, shoot_6, s...
date  (1): date
time  (1): time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jessica <- read_csv("data/tea_jessica.csv")
Rows: 60 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (2): field, counter
dbl  (10): plant_id, shoot_1, shoot_2, shoot_3, shoot_4, shoot_5, shoot_6, s...
date  (1): date
time  (1): time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Note

This example dataset is modified from (Scott et al. 2021). Each day, researchers counted the number of leafhoppers (hoppers) on tea plants and measured the length of 7 labeled tea shoots in cm (recorded in shoot_* columns). leaves is how many leaves were searched for leafhoppers on each plant and counter is the initial of the researcher who took measurements.

Compare visually with visdat

We can compare them a couple of ways. First, we can compare them visually using the visdat package. This only works if the two datasets are the same dimensions.

vis_compare(eric, jessica)

Compare with dplyr::anti_join()

First add row numbers to make it easier to find mistakes in Excel.

# add rownumbers that match Excel (headers are row 1)
eric    <- eric    %>% mutate(row = 2:(n()+1), .before = date)
jessica <- jessica %>% mutate(row = 2:(n()+1), .before = date)

anti_join() takes two data frames and returns only rows that differ between them.

#values in `eric` that are different in `jessica`
anti_join(eric, jessica)
Joining, by = c("row", "date", "field", "time", "plant_id", "shoot_1",
"shoot_2", "shoot_3", "shoot_4", "shoot_5", "shoot_6", "shoot_7", "hoppers",
"counter", "leaves")
# A tibble: 6 × 15
    row date       field time   plant_id shoot_1 shoot_2 shoot_3 shoot_4 shoot_5
  <int> <date>     <chr> <time>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1    11 2016-07-01 A     06:19        10     6.4     2.5     3.8     1       4.6
2    28 2016-07-02 A     06:26         7     3.5     2.7     2.4     2.2     2.4
3    29 2016-07-02 A     06:26         8     8.1     3.8     1.8     6       5.7
4    49 2016-07-03 A     06:11         8     8.5     4.1     2.1     6.4     5.7
5    51 2016-07-03 A     06:11        10     7.9     3.2     5.4     8.6     5.6
6    61 2016-07-03 B     06:31        NA    NA      NA      NA      NA      NA  
# … with 5 more variables: shoot_6 <dbl>, shoot_7 <dbl>, hoppers <dbl>,
#   counter <chr>, leaves <dbl>
#values in `jessica` that are different in `eric`
anti_join(jessica, eric)
Joining, by = c("row", "date", "field", "time", "plant_id", "shoot_1",
"shoot_2", "shoot_3", "shoot_4", "shoot_5", "shoot_6", "shoot_7", "hoppers",
"counter", "leaves")
# A tibble: 6 × 15
    row date       field time   plant_id shoot_1 shoot_2 shoot_3 shoot_4 shoot_5
  <int> <date>     <chr> <time>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1    11 2016-07-01 A     06:19        10     6.4     2.5     3.8     7       4.6
2    28 2016-07-02 A     06:26         7     3.5     2.7     2.4     2.2     2.4
3    29 2016-07-02 A     06:26         8     8.1     3.8     1.8     6       5.7
4    49 2016-07-03 A     06:11         8     8.5     4.1    21       6.4     5.7
5    51 2016-07-03 A     06:11        11     7.9     3.2     5.4     8.6     5.6
6    61 2016-07-03 B     06:31        20     5.8     6.8    16.6    NA       5.8
# … with 5 more variables: shoot_6 <dbl>, shoot_7 <dbl>, hoppers <dbl>,
#   counter <chr>, leaves <dbl>

Errors include:

  • row 11: messy handwriting? (1 or 7?)
  • row 28 & 29: values swapped for shoot_7
  • row 49: missing decimal point
  • row 51: discrepancy in plant ID
  • row 61: missing data in Eric’s version
#after fixing data-entry errors, we get `tea_resolved.csv`
tea <- read_csv("data/tea_resolved.csv")
Rows: 60 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (2): field, counter
dbl  (10): plant_id, shoot_1, shoot_2, shoot_3, shoot_4, shoot_5, shoot_6, s...
date  (1): date
time  (1): time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Explore data summaries

  • You can’t check for errors if you don’t get to know your data!
  • Use skimr::skim() to get a nicely formatted summary
  • Look for number of unique values for categorical variables
  • Look for long tails or strange patterns in mini-histograms for numeric variables
skimr::skim(tea)
Data summary
Name tea
Number of rows 60
Number of columns 14
_______________________
Column type frequency:
character 2
Date 1
difftime 1
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
field 0 1 1 1 0 2 0
counter 0 1 1 1 0 4 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2016-07-01 2016-07-03 2016-07-02 3

Variable type: difftime

skim_variable n_missing complete_rate min max median n_unique
time 0 1 22260 secs 23460 secs 22920 secs 6

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
plant_id 0 1.00 10.50 5.81 1.0 5.75 10.50 15.25 20.0 ▇▇▇▇▇
shoot_1 0 1.00 6.32 2.49 1.5 4.70 5.95 8.17 12.0 ▃▇▆▇▂
shoot_2 0 1.00 6.72 3.75 2.2 3.95 6.35 7.58 15.2 ▇▇▁▁▃
shoot_3 0 1.00 6.25 3.74 1.7 3.08 5.10 8.53 16.6 ▇▃▅▁▁
shoot_4 4 0.93 5.47 3.36 1.2 2.70 4.85 6.85 12.9 ▇▆▃▁▃
shoot_5 0 1.00 5.44 2.44 1.8 3.70 5.00 6.55 11.5 ▆▇▆▁▂
shoot_6 0 1.00 6.04 3.56 0.0 3.88 4.85 7.50 14.1 ▂▇▃▁▂
shoot_7 4 0.93 6.62 4.11 1.1 3.35 5.50 9.95 16.2 ▇▇▃▅▁
hoppers 0 1.00 2.84 2.77 0.0 1.00 2.00 4.00 15.0 ▇▃▁▁▁
leaves 0 1.00 36.67 9.51 30.0 30.00 30.00 50.00 50.0 ▇▁▁▁▃

Or get a more detailed breakdown by running skim() on a grouped data frame:

tea %>% 
  group_by(date) %>% 
  skim()
Data summary
Name Piped data
Number of rows 60
Number of columns 14
_______________________
Column type frequency:
character 2
difftime 1
numeric 10
________________________
Group variables date

Variable type: character

skim_variable date n_missing complete_rate min max empty n_unique whitespace
field 2016-07-01 0 1 1 1 0 2 0
field 2016-07-02 0 1 1 1 0 2 0
field 2016-07-03 0 1 1 1 0 2 0
counter 2016-07-01 0 1 1 1 0 4 0
counter 2016-07-02 0 1 1 1 0 2 0
counter 2016-07-03 0 1 1 1 0 2 0

Variable type: difftime

skim_variable date n_missing complete_rate min max median n_unique
time 2016-07-01 0 1 22740 secs 23100 secs 22920 secs 2
time 2016-07-02 0 1 22440 secs 23160 secs 22800 secs 2
time 2016-07-03 0 1 22260 secs 23460 secs 22860 secs 2

Variable type: numeric

skim_variable date n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
plant_id 2016-07-01 0 1.00 10.50 5.92 1.0 5.75 10.50 15.25 20.0 ▇▇▇▇▇
plant_id 2016-07-02 0 1.00 10.50 5.92 1.0 5.75 10.50 15.25 20.0 ▇▇▇▇▇
plant_id 2016-07-03 0 1.00 10.50 5.92 1.0 5.75 10.50 15.25 20.0 ▇▇▇▇▇
shoot_1 2016-07-01 0 1.00 6.06 2.50 1.5 4.32 6.00 7.78 11.0 ▅▅▇▃▃
shoot_1 2016-07-02 0 1.00 6.34 2.61 1.5 4.65 6.00 8.17 11.5 ▃▇▅▅▃
shoot_1 2016-07-03 0 1.00 6.54 2.47 1.7 5.05 5.90 8.50 12.0 ▁▇▂▅▂
shoot_2 2016-07-01 0 1.00 6.43 3.73 2.2 3.32 6.10 7.58 14.0 ▇▇▂▁▅
shoot_2 2016-07-02 0 1.00 6.77 3.80 2.4 3.95 6.40 7.52 15.1 ▇▇▁▁▃
shoot_2 2016-07-03 0 1.00 6.94 3.90 2.3 4.00 6.55 7.60 15.2 ▇▇▁▁▃
shoot_3 2016-07-01 0 1.00 5.97 3.74 1.7 2.90 4.70 8.45 15.3 ▇▂▃▂▁
shoot_3 2016-07-02 0 1.00 6.22 3.78 1.8 3.17 4.95 8.55 15.5 ▇▃▃▂▁
shoot_3 2016-07-03 0 1.00 6.54 3.86 2.1 3.18 5.20 8.48 16.6 ▇▂▃▁▁
shoot_4 2016-07-01 1 0.95 5.07 3.35 1.2 2.50 3.90 6.35 12.0 ▇▆▅▁▃
shoot_4 2016-07-02 1 0.95 5.51 3.40 1.5 2.85 4.70 6.60 12.6 ▇▇▃▁▃
shoot_4 2016-07-03 2 0.90 5.84 3.49 1.7 3.20 5.40 7.07 12.9 ▇▆▂▁▃
shoot_5 2016-07-01 0 1.00 5.10 2.42 1.8 3.35 4.55 6.43 11.0 ▇▇▇▁▂
shoot_5 2016-07-02 0 1.00 5.38 2.50 2.2 3.65 5.00 6.55 11.4 ▆▇▃▁▂
shoot_5 2016-07-03 0 1.00 5.84 2.45 2.4 4.60 5.70 6.70 11.5 ▅▇▃▁▂
shoot_6 2016-07-01 0 1.00 5.76 3.36 1.9 3.82 4.60 6.60 13.5 ▇▇▁▂▂
shoot_6 2016-07-02 0 1.00 6.14 3.62 1.0 4.08 4.90 7.50 13.9 ▃▇▂▁▂
shoot_6 2016-07-03 0 1.00 6.20 3.85 0.0 3.95 5.00 8.03 14.1 ▂▇▃▁▂
shoot_7 2016-07-01 1 0.95 6.22 4.15 1.1 3.15 5.00 9.65 15.4 ▇▆▂▃▂
shoot_7 2016-07-02 1 0.95 6.58 4.15 1.6 3.35 5.50 9.90 15.8 ▇▆▂▃▂
shoot_7 2016-07-03 2 0.90 7.09 4.20 1.4 4.07 5.95 10.47 16.2 ▇▇▂▅▁
hoppers 2016-07-01 0 1.00 3.26 3.12 0.0 2.00 2.00 4.00 15.0 ▇▃▁▁▁
hoppers 2016-07-02 0 1.00 2.90 1.62 0.0 2.00 3.00 4.00 6.0 ▅▃▇▅▃
hoppers 2016-07-03 0 1.00 2.35 3.33 0.0 0.00 2.00 3.25 15.0 ▇▂▁▁▁
leaves 2016-07-01 0 1.00 30.00 0.00 30.0 30.00 30.00 30.00 30.0 ▁▁▇▁▁
leaves 2016-07-02 0 1.00 30.00 0.00 30.0 30.00 30.00 30.00 30.0 ▁▁▇▁▁
leaves 2016-07-03 0 1.00 50.00 0.00 50.0 50.00 50.00 50.00 50.0 ▁▁▇▁▁

Explore data visually

  • visdat::vis_guess() can help spot inconsistencies
  • I’ll change one of the plots to a number to demonstrate
#change field in the 10th row to "1"
tea$field[10] <- 1
#doesn't change the type of the column
class(tea$field)
[1] "character"
#but vis_guess() spots the mistake!
visdat::vis_guess(tea)

Potential mistakes spotted:

  • numeric value in field column
  • a decimal in the hoppers column

Data validation pipelines with pointblank

library(pointblank)
  • pointblank provides 6 (six!) workflows for validating data

  • The Data Quality Reporting Workflow (VALID-1) is probably most useful for this group

  • Start with a data frame, create an “agent”, tell it what to expect of your data with validation functions, and let it “interrogate” your data

  • Output is a HTML table with buttons to download CSV files of any data that didn’t pass your validations

pointblank demo

  1. Decide on “action levels”. Can set a number or fraction of rows as a threshold for a warning or error
al <- action_levels(warn_at = 1, stop_at = .05)
al
-- The `action_levels` settings
WARN failure threshold of 1test units.
STOP failure threshold of 0.05 of all test units.
----
  1. Create agent
agent <- 
  create_agent(
    tbl = tea, #our data example from before
    actions = al
  )
  1. Specify validation conditions

    • Basic checks on column types with col_is_*() functions
    • Check column values with col_vals_*() functions
    • Check rows (e.g. duplicate rows) with rows_*() functions
agent_informed <- 
  agent %>% 
  col_is_character(c(field, counter)) %>% #should be character
  col_is_date(date) %>% #should be a date
  
  col_vals_in_set(counter, c("W", "G", "E")) %>% #The three researchers
  col_vals_lt(  #expect shoots < 15cm
    columns = starts_with("shoot_"), 
    value =  15,
    na_pass = TRUE
  ) %>% 
  rows_distinct(columns = vars(plant_id, date)) #no duplicate plant IDs
Important

pointblank historically used the vars() function to specify columns (e.g. col_is_character(vars(field))), but is moving toward a more tidyverse-like approach. However, at the time of writing this, some functions and arguments still require vars() like rows_distinct() above.

  1. Interrogate!
agent_informed %>% interrogate()
Pointblank Validation
[2022-07-19|14:12:10]

tibble teaWARN 1 STOP 0.05 NOTIFY
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_is_character
 col_is_character()

field

1 1
1
0
0

2
col_is_character
 col_is_character()

counter

1 1
1
0
0

3
col_is_date
 col_is_date()

date

1 1
1
0
0

4
col_vals_in_set
 col_vals_in_set()

counter

W, G, E

60 59
1
1
0

5
col_vals_lt
 col_vals_lt()

shoot_1

15

60 60
1
0
0

6
col_vals_lt
 col_vals_lt()

shoot_2

15

60 58
1
2
0

7
col_vals_lt
 col_vals_lt()

shoot_3

15

60 57
1
3
0

8
col_vals_lt
 col_vals_lt()

shoot_4

15

60 60
1
0
0

9
col_vals_lt
 col_vals_lt()

shoot_5

15

60 60
1
0
0

10
col_vals_lt
 col_vals_lt()

shoot_6

15

60 60
1
0
0

11
col_vals_lt
 col_vals_lt()

shoot_7

15

60 57
1
3
0

12
rows_distinct
 rows_distinct()

plant_id, date

60 60
1
0
0

2022-07-19 14:12:11 EDT < 1 s 2022-07-19 14:12:11 EDT
Tip

Click the blue “CSV” buttons above to download a .csv file of just the rows that failed that particular validation

Flexible validations

If a validation function you need doesn’t exist, you can use col_vals_expr()

E.g. let’s add a validation to check that hoppersis always a whole number.

agent_informed <-
  agent_informed %>% 
  col_vals_expr(~ hoppers %% 1 == 0) 

agent_informed %>% interrogate()
Pointblank Validation
[2022-07-19|14:12:10]

tibble teaWARN 1 STOP 0.05 NOTIFY
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_is_character
 col_is_character()

field

1 1
1
0
0

2
col_is_character
 col_is_character()

counter

1 1
1
0
0

3
col_is_date
 col_is_date()

date

1 1
1
0
0

4
col_vals_in_set
 col_vals_in_set()

counter

W, G, E

60 59
1
1
0

5
col_vals_lt
 col_vals_lt()

shoot_1

15

60 60
1
0
0

6
col_vals_lt
 col_vals_lt()

shoot_2

15

60 58
1
2
0

7
col_vals_lt
 col_vals_lt()

shoot_3

15

60 57
1
3
0

8
col_vals_lt
 col_vals_lt()

shoot_4

15

60 60
1
0
0

9
col_vals_lt
 col_vals_lt()

shoot_5

15

60 60
1
0
0

10
col_vals_lt
 col_vals_lt()

shoot_6

15

60 60
1
0
0

11
col_vals_lt
 col_vals_lt()

shoot_7

15

60 57
1
3
0

12
rows_distinct
 rows_distinct()

plant_id, date

60 60
1
0
0

13
col_vals_expr
 col_vals_expr()

hoppers%%1 == 0

60 59
1
1
0

2022-07-19 14:12:13 EDT < 1 s 2022-07-19 14:12:13 EDT

Check that shoots are growing

The tea shoots should be consistently growing (within measurement error), so we can use col_vals_increasing() to check that shoot height is always going up. We use the segment argument to specify that this check happens within each plant_id

agent_informed <-
  agent_informed %>% 
  #just with one shoot for demo purposes, but could use starts_with("shoot_")
  col_vals_increasing(shoot_1, segments = vars(plant_id))

agent_informed %>% interrogate()
Pointblank Validation
[2022-07-19|14:12:10]

tibble teaWARN 1 STOP 0.05 NOTIFY
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_is_character
 col_is_character()

field

1 1
1
0
0

2
col_is_character
 col_is_character()

counter

1 1
1
0
0

3
col_is_date
 col_is_date()

date

1 1
1
0
0

4
col_vals_in_set
 col_vals_in_set()

counter

W, G, E

60 59
1
1
0

5
col_vals_lt
 col_vals_lt()

shoot_1

15

60 60
1
0
0

6
col_vals_lt
 col_vals_lt()

shoot_2

15

60 58
1
2
0

7
col_vals_lt
 col_vals_lt()

shoot_3

15

60 57
1
3
0

8
col_vals_lt
 col_vals_lt()

shoot_4

15

60 60
1
0
0

9
col_vals_lt
 col_vals_lt()

shoot_5

15

60 60
1
0
0

10
col_vals_lt
 col_vals_lt()

shoot_6

15

60 60
1
0
0

11
col_vals_lt
 col_vals_lt()

shoot_7

15

60 57
1
3
0

12
rows_distinct
 rows_distinct()

plant_id, date

60 60
1
0
0

13
col_vals_expr
 col_vals_expr()

hoppers%%1 == 0

60 59
1
1
0

14
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 3
1
0
0

15
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 2
1
1
0

16
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 3
1
0
0

17
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 2
1
1
0

18
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 3
1
0
0

19
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 3
1
0
0

20
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 3
1
0
0

21
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 3
1
0
0

22
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 3
1
0
0

23
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 3
1
0
0

24
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 3
1
0
0

25
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 3
1
0
0

26
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 3
1
0
0

27
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 2
1
1
0

28
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 2
1
1
0

29
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 3
1
0
0

30
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 2
1
1
0

31
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 3
1
0
0

32
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 3
1
0
0

33
col_vals_increasing
 col_vals_increasing()

shoot_1

0, 0

3 2
1
1
0

2022-07-19 14:12:15 EDT 1.3 s 2022-07-19 14:12:17 EDT

Publishing validation reports

  • Students, faculty, and staff at University of Arizona have access to RStudio Connect which allows you to publish an RMarkdown document to the web with a single click. (Learn More)

  • Data validation can be automated in a variety of ways. If you are interested in more advanced applications of data validation for your lab, contact us!

Fixing mistakes

  • For true mistakes in data entry (paper -> spreadsheet), probably ok to just edit raw data

  • For other errors, best practice:

    • Don’t edit raw data!
    • Record all changes to raw data (e.g. by using an R script to make them)
    • Flag observations that have been changed
    • Publish raw data, cleaning steps/scripts, and “cleaned” data
  • Keep an eye out for future workshops on data wrangling/tidying/cleaning

Help

Feel free to drop by the CCT Data Science Team office hours, which happens every Tuesday morning. We would love to help you with your R questions about date/time, and more!

You can also make an appointment with Eric to discuss this content and get troubleshooting help.

References

Scott, Eric R., Ji-Peng Wei, Xin Li, Wen-Yan Han, and Colin M. Orians. 2021. “Differing Non-Linear, Lagged Effects of Temperature and Precipitation on an Insect Herbivore and Its Host Plant.” Ecological Entomology 46 (4): 866–76. https://doi.org/10.1111/een.13023.